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(57) ABSTRACT 

A method for spatializing text content for enhanced visual 
browsing and analysis. The invention is applied to large text 
document corpora such as digital Ubraries, regulations and 
procedures, archived reports, and the like. The text content 
from these sources may be transformed to a spatial repre- 
sentation that preserves informational characteristics from 
the documents. The three-dimensional representation may 
then be visually browsed and analyzed in ways that avoid 
language processing and that reduce the analysts' effort. 
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BUILD AN ElECTRONIC DATABASE OF A PLURALITT OF DOCUMENTS 



CREATE A PLURALITY OF HIGH DIMENSIONAL VECTORS. ONE FOR EACH CF IHE 
PLURALITY OF OOCUHEMTS SUCH THAT EACH OF THE HIGH OIMEKSIONAL 
VECTORS REPRESENTS THE REUTIVE RELATIONSHIP OF THE INDIVIDUAL 
OOCUMEKIS TO U TOPIC ATTRIBUTE 



ARRANGE Trt HIGH DIHENSIONAI VECTORS INTO aUSTERS, EACH OF THE 
CLUSTERS REPRESENTING A PLURALITY OF DOCUHENTS GROUPED BY 
THE RELATIVE SIGNIFICANCE OF THEIR RELATIONSHIP TO A TOPIC ATTRIBUTE 



CALUCULATE CENTROID COORDINATES AS THE CENTER OF MASS OF EACH 
aUSTES. THE CENTROID COORDINATES BEING STORED OR PROJECTED 
IN A TW-OI«NSIONAL PLANE 



CONSTRUCT A VECTOR FOR EACH OOCUHEKT. THE VECTOR CONTAINING THE 
DISTANCE FROM THE OOCUHENT TO EACH CENTROID COORDINATE 
IN HIGH-DIHENSICWAL SPACE 



CREATE A PLURALITY OF TEHH LAYERS. EACH OF THE TERM LAYERS 
CORRESPONDING TO A DESCRIPTIVE TfflH APPLIED TO EACH CLUSTER 
AM) IDENnFYING x. y COORDINATES FOR EACH DOCUICNT 
ASSOCIATED WITH EACH TERH LAYER 



CREATE A Z COORDINATE ASSOCIATED WITH EACH TERH LAYER FOR EACH 
x.y COORDINATE BY APPLYING A SHOOTHING FUNCTION TO THE x,y 
COORDINATES FOR EACH OOCUHENT. AND SUPERIHPOSIKG tPON ONE 
ANOTHER AU OF THE TERH LAYERS 
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THREE-DIMENSIONAL DISPLAY OF 
DOCUMENT SET 

REFERENCE TO RELATED APPUCATION 

This application is a c»ntinuation of application Scr. No. 
09/235,463 filed on Jan. 22, 1999, now abandoned which is 
a continuation of application Ser. No. 08/695,455 filed on 
Aug. 12, 1996, now abandoned and which are hereby 
incorporated by reference in their entirety. 

This invention was made with Government support under 
Contract DE-AC06 76RLO 1830 awarded by the U.S. 
Department of Energy. The Government has certain rights in 
the invention. 
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This invention relates generally to the field of information 
storage and retrieval, or "information visualization". More 
particularly, the invention relates to a novel method for 
text-based infonmation reU-ieval and analysis through the ^ 
creation of a visual representation for complex, symboUc 
information. This invention also relates to a method of 
stored information analysis that (i) requires no human pre- 
structuring of the problem (ii) is subject to independent, (iii) 
is adaptable to multi-media information, and (iv) is con- 
structed on a framework of visual presentation and human 
interaction. 
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Current visualization approaches demonstrate effective 
methods for visualizing mostly structured and/or hierarchi- 
cal information such as organization charts, directories, 
entity-attribute relationships, and the like. Mechanisms to 
permit free text visualizations have not yet been perfected. 
The idea that open text fields themselves or raw prose might 
be candidates for information visualization is novel. The 
need to read and assess large amounts of text that is retrieved 
through graph theory or figural displays as "visual query" 
tools on document bases puts severe limits on the amount of 
text information that can be processed by any analyst for any 
purpose. At the same time, the amount of "open source" 
digital information is increasing exponentially. Whether it be 
for market analysis, global environmental assessment, inter- 
national law enforcement or intelligence for national 
security, the analyst task is to peruse large amounts of data 
to detect and recognize informational 'patterns' and pattern 
irregularities across the various sources. 

True text visualizations that would overcome these time 
and aitentional consu-ainls must represent textual content 50 
and meaning to the analyst without them having to read it in . 
the manner that text normally requires. These visualizations 
would instead result from a content abstraction and spatial- 
izalion of the original text document that would transform it 
into a new visual representation conveying information by ss 
image instead of prose. 

Prior researchers have attempted to create systems for 
analysis of large text-based information data bases. Such 
systems have been built on Boolean queries, document lists 
and time consuming human involvement in sorting, editing 60 
and structuring. The simplification of Boolean function 
expressions is a particularly well-known example of prior 
systems. For example, in U.S. Pat. No. 5,465,308, a method 
and apparatus for pattern recognition utilizes a neural net- 
work to recognize two dimensional input images which are 65 
sufiBcienily similar to a database of previously stored two 
dimensional images. Images are first image processed and 



subjected to a Fourier transform which yields a power 
spectrum. An in-class to out-of-class study is performed on 
a typical collection of images in order to determine the most 
discriminatory regions of the Fourier transform. Feature 
vectors are input to a neural network, and a query feature 
vector is applied to the neural network to result in an output 
vector, which is subjected to statistical analysis to determine 
if a sufficiently high confidence level exists to indicate that 
a successful identification has been made. 

SUMMARY OF THE INVENTION 

The SPIRE (Spatial Paradigm for Information Retrieval 
and Exploration) software supports text-based information 
retrieval and analysis through the creation of a visual 
representation for complex, symbolic information. A pri- 
mary goal of SPIRE is to provide a fundamentally new 
visual method for the analysis of large quantities of infor- 
mation. This method of analysis involves information 
retrieval, characterization and examination, accomplished 
without human pre -structuring of the problem or p re-sorting 
of the information to be analyzed. The process produces a 
visual representation of results. 

More specifically, the novel process provides a method of 
determining and displaying the relative content and context 
of a number of related documents in a large document set. 
The relationships of a plurality of documents are presented 
in a three-dimensional landscape with the relative size and 
height of a peak in the three-dimensional landscape repre- 
senting the relative significance of the relationship of a topic, 
or term, and the individual document in the document set. 
The steps of the process are: 

(a) constructing an electronic database of a pluraUty of 
documents to be analyzed; 

(b) creating a plurality of high dimensional vectors, one 
for each of the plurality of documents, such that each of the 
high dimensional vectors represents the relative relationship 
of the individual documents to the term, or topic attribute; 

(c) arranging the high dimensional vectors into clusters, 
with each of the clusters representing a plurality of docu- 
ments grouped by relative significance of their relationship 
to a topic attribute; 

(d) calculating centroid coordinates as the center of mass 
of each cluster, the centroid coordinates being stored or 
projected in a two-dimensional plane; 

(e) constructing a vector for each document, with each 
vector containing the distance from the document to each 
centroid coordinate in high-dimensional space; 

(f) creating a plurality of term (or topic) layers, each of the 
term layers corresponding to a descriptive term (or topic) 
applied to each cluster, and identifying x,y coordinates for 
each document associated with each term layer; and 

(g) creating a z coordinate associated with each term layer 
for each x,y coordinate by applying a smoothing function to 
the x,y coordinates for each document, and superimposing 
upon one another all of the terra layers. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The accompanying drawings, which are incorporated in 
and constitute a part of the specification, illustrate preferred 
embodiments of the invention, and together with the 
description, serve to explain the principles of the invention. 

FIG. 1 is a graphical representation of database relation- 
ships in two-dimensional space; 

FIG. 2 is a one dimensional representation of documents 
represented in FIG. 1; 
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FIG. 3 is a smoothed version of the representation of FIG. 

2; 

FIG. 4 is a three-dimensional representation of a database 
having small theme sets and high discrimination; and 

FIG. 5 is a three-dimensional representation of a database 
having large theme sets and low discrimination. 

RG. 6 is a block diagram presenting the sequence step in 
the preferred embodiment of the present invention. 

DETAILED DESCRIPTION OF THE 10 
INVENTION 

As used herein, the following terms shall have the fol- 
lowing definitions: 

1. Information Retrieval means access and discovery of 
stored information. It requires the efficient retrieval of 
relevant information from ill-structured natural language- 
based documents. The effectiveness of a retrieval method is 
measured by both precision, or the proportion of relevant to 
non-relevant documents identified, and recall, or the per- 
ccntage of relevant documents identified. 

2. Infonmation analysis is discovery and synthesis of 
stored information. It involves the detection of information 
patterns and trends and the construction of inferences con- 
cerning these patterns and trends which produce knowledge. ^5 

The present invention is known as SPIRE (Spatial Para- 
digm for Information Retrieval and Exploration). SPIRE is 
a method of presenting information by relative relationships 
of content and context — that is, the "relatedness" of a 
plurality of documents to one another both by their sheer 30 
numbers and by their subject matter. It is comprised of a 
plurality of elements which define it's usefulness as an 
information analysis tool. Briefly, the elements are: a com- 
bination of an intuitive and attractive interface, well inte- 
grated with a powerful set of analytical tools; a computa- 35 
tionally efficient approach to both clustering and projection, 
essential for large document sets; a three-dimensional visu- 
aUzation component to render stored information in a three- 
dimensional format (known as ThemeScapes); and a unique 
interplay between the 2-dimensional and 3-dimensional 40 
visualization components. 

An essential first step in the transformation of natural 
language text to a visual form is to extract and structure 
information about the text — through a "text processing 
engine". A text processing engine for information visualiza- 45 
tion requires: (1) the identification and extraction of essen- 
tial descriptors or text features, (2) the efficient and flexible 
representation of documents in terms of these text features, 
and (3) subsequent support for information retrieval and 
visualization. There are a number of acceptable text engines 50 
currently available on the market or as research prototypes, 
such as the Hecht Nielson Corporation's MalchPlus or the 
National Security Agency's Acquaintance. 

The parameters typically measured by a text engine fall 
into one of three general types. First, 'frequency-based 55 
measures' on words, utilizing only first order statistics. The 
presence and count of unique words in a document identifies 
those words as a feature set. The second type of feature is 
based on higher order statistics taken on the words or letter 
strings. Here, the occuaence, frequency, and context of 60 
individual words are used to characterize a set of explicit or 
implicitly defined word classes. The third type of text feature 
is semantic — the association between words is not defined 
through analysis of the word corpus, as with statistical 
features, but is defined a priori using knowledge of the 65 
language. Semantic approaches may utilize natural or quasi- 
natural language understanding algorithms. 
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The second requirement of the text engine (efiScient and 
flexible representation of textual information) is satisfied if 
identified text features are \ised as a shorthand representation 
of the original docimicnt. Instead of complex and unwieldy 
5 strings of words, feature sets are the basis of document 
representation. Valume reduction of information is required 
to make later computations possible 

Finally, the text engine must provide easy, intuitive access 
to the information contained within the corpus of documents 
through retrieval and visualization. To provide eflBcient 
retrieval, the text processing engine must pre-process docu- 
ments and efficiently implement an indexing scheme for 
individual words or letter strings. Information retrieval 
implies a query mechanism to support it — often a basic 
Boolean search, or a high level query language, or the visual 
manipulation of spatialized text objects in a display. 

The process of the present invention can best be described 
with reference to a five-stage text visualization process. 

STAGE ONE The receipt of electronic versions of textual 
documents into the text engine descriTjed above is essentially 
independent of, but a required precursor for, the SPIRE 
process. The documents are input as unprocessed 
documents — no key wording, no topic extraction, no pre- 
defined structure is necessary. In fact, the algorithms used to 
create a spatial representation of the documents presupposes 
the characteristics of natural language commimicalion so 
that highly structured information (e.g. tables and outlines) 
cannot be adequately processed and will result in diminished 
results. 

STAGE TWO The analysis of natural language docu- 
ments provides a characterization of the documents based on 
content. Performed in the text engine, the analysis can be 
first order (word counts and/or natural language understand- 
ing heuristics) or higher order information captured by 
Bayesian or neural nets. The required output is that each 
document must be converted to a high dimensional vector. 
A metric on the vector space, such as a Euclidean distance 
measure or cosine measure, can be used to determine the 
similarity of any two documents in the collection. The 
output of this processing stage is a high dimensional vector 
for each document in the collection. 

STAGE THREE The document vectors must be grouped 
in the high dimensional metric space — ^"clustering". In order 
to satisfy performance requirements for large document sets, 
clustering algorithms with a lower order of complexity are 
essential. The output of this stage is a partition set on the 
document coLection with measures for each cluster of 
magnitude (count) dispersion. While it is believed that there 
are a number of different approaches to the clustering of 
information that will lead to acceptable results. Applicants 
have determined to limit the document vectors to "large" 
(more than 3,000 documents) and "small" (less than 3,000 
documents) data sets. For small data sets, readily available 
clustering algorithms have been used, with primary empha- 
sis on k-means and complete linkage hierarchical clustering. 

For larger data sets, traditional clustering algorithms can 
not be used because of the exponential complexity of the 
clustering algorithms as the data set increases. Applicants 
have therefore devised an alternative method for clustering 
in large problem sets known as "Fast Divisive Clustering". 
In this process, the user selects the desired number of 
clusters. No assistance is provided in selecting this number, 
but it should be heuristically based on knowledge of the data 
set, such as size, diversity, etc. After the number of seeds has 
been selected, the next step is to place seeds in the multi- 
dimensional document space. A sampling of the subspaces is 
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performed to ensure that there is a reasonable distribution of relatively small amount of initial calculation is required; 

the cluster seeds — that is, they are not too close to one after that each document can be positioned using simple 

another. Then, the hyperspheres are defined around each matrix operations, with a computational complexity on the 

cluster seed and assigned to all documents within a hyper- order of the number of cluster ccntroids. With the ccntroids 

sphere to the corresponding cluster. Iterative ly, the center of 5 placed in the 2D plane, a vector is constructed for each 

mass is calculated yielding a new cluster ccntroid, and document which contains the distances from the document 

therefore a new location for the hypersphere and new to each cluster centroid in the high dimensional space. Given 

document assignments. Within a few iterations, locations for the vector of hyperspacc distances, a closed form solution 

the cluster ccntroids will be determined, and the final can be constructed which rapidly produces the 2D coordi- 

documcnl to cluster assignments are made. Changes in 10 nates of each document in the document collection, 

distances between iterations should remain within a pre- More specifically, if one begins with n cluster centro ids cj 

defined threshold. (the l-dimensional projection of the cluster ccntroids from 

This third stage can be summarized as: high-dimensional space), assume the coordinate system is 

, . . . r J u J u . • »• such that the center of mass of all the cluster ccntroids is at 

(i) seleclmg the number of seeds, based on characteristics 

of the document collection; ^ ^ ongin. 

(ii) placing seeds in hyperspace by sampling regions to ^ ^ 
ensure reasonable distribution of seeds; c.i = -^jCj,; ci = -^cj2 

(iii) identifying non-overlapping hyperspheres (one for 
each cluster) and assigning each document to a cluster 20 

based on which hypersphere the document is located ^^^^ ^ coordinates of the centroids as follows: 
within; 

(iv) calculating a centroid coordinate — the center of the cji{new)*Cji{oid)~c.i; Cj2{^w)~c^{old)-c.^ 12] 

mass for each cluster; and ^ ^ ^ , , . • j 1. r 

. ..... . /. \ .1 , A nc The squared distance between each document 1 and each ot 

(V) repeating steps (lu) and (IV) unul«;nu^^^ 25 ^^^^ centroids j (as measured in the original hi^- 

threshold dimensional space) is d... There are m documents with 

STAGE FOUR This stage requires the projection of the ^^^^ 2-dimensional c'oordinates x. For each document 

high dimensional document vectors and the cluster centroids . ^^^^^ ^^^^^ ^ ^^^^ ^^^^ 

produced in Stage 3 into a 2-dimensional representation ' 

(FIG. 1), The 2-D planar representation of the documents 30 ^.«|[^._c^^|2 [3] 
and clusters is necessary for user viewing and interaction. 

Because the number of dimensions is reduced from bun- The average distance between the document and the ccn- 

dreds to two, a significant loss of information naturally troids 
results. Some representational anomalies are produced by 

projection, causing documents to be placed with an associ- 35 j « [4] 

a ted error. The nature and quantity of this error are defining ^ nZ^^^*^ 
characteristics of the chosen projection. As with the clus- 
tering stage, compute time is important for large document 

sets. Therefore, projection algorithms which are of a low and w^y is the unknown quantity 
order of complexity are vital. The product of this stage is a 40 

set of 2D coordinates, one coordinate pair (10,12) for each ^it^iCr^nCn-^^i^p. [^1 

document. • ro. .u w i *■ f If it is desired to force documents to be closer to the 

As with the clustering of Stage three, multiple opUons for ^^^^^^.^ ^^^^^^ ^ ^ 1^^^ 

projection uchmques are available. For rek^^^^ ^^.j.^^ Let w, be an input 

sets Applicants have ^^^^^^ i? "^^^^^ weight-this is interpreted as the distance of a point from its 

Scaling Algorithm", or MDS. The MDS utilized pairwise ^J^^^^^^^ centroid and is w, times more important than its 

distances (Euchdean or cosine angle) between all document ^^^^ ^^^^^ ^^^^^^ ^ ^^^^^ ^ ^^^^^^ 

pairs. The algorithm attempts to reserve the distances deter- off-diagonal and Vs on the di^onal, except 

mined in the high-dimensional space when projecting to 2D ^^^^^ ^ 

space, m doing so the discrepancy between pairwise dis- 50 ^^^^^^^^ the position of the ith document, when that 

tances in the high dimensional space and the 2D counter- ^^^^^^^^ ^ ^ ^^^^er of the jth cluster, will be 
parts are represented as an error measure. Ine algorithm 

iteratively adjusts document positions in the 2D plane in i,^(c^SjC)-^C^SjYi [6] 
order to minimize the associated error. The distance from 

every point to every other point is considered and weighed 55 The fourth stage can be summarized as: 

against a preset desired distance. Every point influences (i) performing an anchored least stress analysis on cluster 

every other point, making MDS a computationally intensive centroid coordinates in hyperspace; 

algorithm. (ii) producing a vector for each document with distance 

For larger data sets, MDS is impractical due to the measures from the document to each cluster centroid; and 

exponential order of complexity, and Applicants have there- 60 (iii) constructing an operator matrix and multiply matrix 

fore developed a projection algorithm called "Anchored by each vector in step (ii) to produce two-dimensional 

Least Stress". When starting with a fixed number of points coordinates for each document. 

(cluster centroids which have been calculated in stage three), STAGE FIVE The output of Stage four (a coordinate pair 

the algorithm considers only the distance from a point to the for each document and cluster ccntroid) is displayed in a 

various cluster centriods, not the distance to every other 65 scatter plot yielding what Applicanis call the "Galaxies" 

point, ^rhe document is placed so that its position reflects its two-dimensional visualization. For this two-dimensional 

similarity or dissimilarity to every cluster centroid. Only a visualization, no further computation of the Stage Four 
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results is required. A ihree-dimensional represeDtaiion of the 
Stage Four results does require further computation, and 
results in what Applicant calls a thematic landscape, or 
"Theme Scapes". This 3D representation provides an intui- 
tive visual measure and a spatial position in display space for 
dominant topics in a corpus of unstructured documents. 

ThemeScapes solves the two most troublesome problems 
encountered with two-dimensional textual information 
analysis. That is, important subjects of the database are not 
easily or accurately disccrnable — the major topics arc 
imprecisely displayed, if provided at all, and are not spa- 
tially organized to support the spatial organization of the 2D 
document display. Secondly, documents are not readily 
associated with the main topics which they contain. Simi- 
larity between documents is conveyed through proximity, 
but the relationship between documents and topics are 
indeterminate. How close a particular document is associ- 
ated with a topic or how a pair of documents are topically 
related are difficult or impossible to determine. 

First, identification of regional topics, or terms, and the set 
of documents which contain them must be identified. The 
gisting features of the text engine will identify the major 
topics of a corpus of documents. While commerciaUy avail- 
able text engines provide the gisting feature, such text 
engines fail to provide a local, spatial representation of the 
theme, a composite measure of theme, a quantitative mea- 
sure of theme or document by document measure of theme. 
A clustering of the n-dimensional document vectors 
(produced in stage three clustering) will result, and the 
clusters 10 are projected into 2D space so that each docu- 
ment has an assigned x,y coordinate pair, as illustrated in 
FIG. 1. For each of these clusters, a set of terms which are 
both "topical" in nature, as measured by serial clustering, 
and maximally discriminating between clusters, as measured 
by the product of the frequency of the term within the 
documents of a particular cluster and the frequency of the 
term in all other. The general form of the topic equation is 

term valuc„ i-f,^^„ [71 

with 

f term n/clusler 1= frequency of term n in cluster I 

2f term n/clustcr ]=frcqucncy of term n in all other clusters 

and the highest value topics are selected. 

The terms derived using this equation are the terms which 
best discriminate clusters from one another. A number of 
terms or topics for each cluster are automatically and 
heuristically selected, with topic value, frequency, cluster 
size, desired number of terms per cluster and per document 
collection all considered in the selection process. Each term 
or topic layer represents the distributed contribution of a 
single term/topic to the surface elevation of a "theme scape". 
Topic layer thickness may vary over the area of the simu- 
lated landscape based on the probabiUty of finding a speci- 
fied term within a document at each two dimensional 
coordinate. After all the individual layers have been 
computed, a composite layer is derived by summing each of 
the term layers. A topic layer is thickest where the density of 
documents that contain that term are highest. In areas where 
there are few documents or few documents that contain a 
given term, the topic layer is very thin. High ground on the 
theme scape represents regions where there is an alignment 
of terms in underlying documents — or a common theme 
among proximal documents. Regions that are lower and less 
pronounced reflect documents that are more general in their 
content and less focused on a single theme. 
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Each region or cluster is then characterized by a set of 
terms or topics. Associated with each topic for each cluster 
is a document set. The document set is nothing more than the 
result of a Boolean query with the topic as the keyword. The 
s first stage of ThemeScape construction is complete when 
both regional topics and their corresponding document sets 
are identified. 

The second stage of ThemeScapes development, forma- 
tion of the three-dimensional surface for individual topics 

10 identified above requires a smoothing filter be run over the 
x,y coordinates of the document display. This process is 
analogous to operations such as edge detection or feature 
enhancement in image processing. As illustrated in FIGS. 2 
and 3, individual points 22 along the x-axis indicate the 

35 location of a document in the topic's document set. A 
smoothing function is run across each point creating a z 
coordinate associated with the term layer for each x,y pair, 
represented as surface 24 above the x-axis. The equation for 
calculating the y coordinate corresponding to each x coor- 

20 dinate will be of the form 

A-2:„_.''^rf,«.-A*+"). [81 

with 

25 

d^„-l for document present at coordinate x+n, else 0 
f(x+n) the value of the smoothing function at x„ 
2m-width of the smoothing function centered about x. 
The two dimensional calculation of a ThemeScape as 
illustrated in FIG. 3 utihzes a two dimensional grid of 
documents and a two dimensional smoothing function, 
producing a third dimension reflecting the probability of 
finding a document with the given topic in the given vicinity. 

Finally, all individual topic ThemeScapes are supetposi- 
tioned. The individual elevations from each term layer are 
added together to form a single terrain corresponding to all 
topics. Thus, 

of clmtcr tcniu ^gj 



Generally, normalization of the above equation is per- 
formed. 

45 The result of this computation is a "landscape" that 
conveys large quantities of relevant information. The terrain 
simultaneously communicates the primary themes of an 
arbitrarily large collection of documents and a measure of 
their relative magnitude. Spatial relationships defined by the 

50 landscape reveal the intricate interconnection of themes, the 
existence of information gaps or negative information. For 
example, FIG. 4 illustrates a "theme scape" 40 of a database 
with 200 documents and 50 themes. In this data set, themes 
had relatively small document sets (a low number of docu- 

55 ments contained in each theme), but high theme discrimi- 
nation values (the documents were clustered close to the 
theme location). More prominent peaks are characteristic of 
the high discrimination values, as for example peak 42 
representing "nuclear weapons" and peak 44 representing 

60 "health physics". 

FIG. 5 represents a database with the same number of 
documents and themes as in FIG. 4, however the themes 
have relatively large document sets and low theme discrimi- 
nation values, as at peak 52 representing "lasers" and peak 

65 54 representing "genetics". 

Therefore, the ThemeScape function of the present inven- 
tion can be summarized as follows: 
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(i) receive n-dimensional context vector from text engine 
for each document and clixster documents in n-dimensional 
space; 

(ii) for each such cluster, receive from text engine asso- 
ciated gisting terms or topics; 

(iii) creating a list of topics for each cluster; 

(iv) creating global keyword list by combining the topics 
for each cluster and eliminating common terms (such as a, 
and, but, the); 

(v) performing keyword query on topic, producing a list 
of documents associated with the topic; 

(vi) identifying coordinates for all documents associated 
with the topic, producing a matrix of retrieved documents in 
the x,y display coordinates; 

(vii) applying a smoothing function to each x,y pair, 
producing a z coordinate associated with the topic for each 
x,y pair; and 

(viii) repeating steps (v) and (vi) for each term in the list 
identified in step (iv). 

An embodiment of the present invention is shown in FIG. 
6. The embodiment provides a method of determining and 
displaying the relative content and context of a number of 
related documents in a large document set. The relationships 
of a plurality of documents are presented in a three- 
dimensional landscape with the relative size and height of a 
peak in the three-dimensional landscape representing the 
relative significance of the relationship of a topic, or term, 
and the individual document in the document set. The steps 
of the process are shown in steps 602 through 614 of RG. 
6, including: (a) constructing an electronic database of a 
plurality of documents to be analyzed (step 602); (b) creat- 
ing a plurality of high dimensional vectors, one for each of 
the plurality of documents, such that each of the high 
dimensional vectors represents the relative relationship of 
the individual documents to the term, or topic attribute (step 
604); (c) arranging the high dimensional vectors into 
clusters, with each of the clusters representing a plurality of 
documents grouped by relative significance of their relation- 
ship to a topic attribute (step 606); (d) calculating centroid 
coordinates as the center of mass of each cluster, the centroid 
coordinates being stored or projected in a two-dimensional 
plane (step 608); (e) constructing a vector for each 
document, with each vector containing the distance from the 
document to each centroid coordinate in high-dimensional 
space (step 610); (f) creating a plurality of term (or topic) 
layers, each of the term layers corresponding to a descriptive 
term (or topic) applied to each cluster, and identifying x,y 
coordinates for each document associated with each term 
layer (step 612); and (g) creating a z coordinate associated 
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with each term layer for each x,y coordinate by applying a 
smoothing function to the x,y coordinates for each 
document, and superposing upon one another all of the terms 
layers (step 614). 

5 It will be apparent to those skilled in the art that various 
modifications can be made to the methods disclosed herein 
for producing a three-dimensional representation of a 
database, without departing from the scope or spirit of the 
invention, and it is intended that the present invention cover 

10 modifications and variations of the methods claimed herein 
to the extend they come within the scope of the appended 
claims and their equivalents. 
We claim: 

1. A method of determining and displaying the relative 
15 content and context of a nimtiber of documents in a large 
document set, wherein the relationships of a plurality of 
documents are presented in a three-dimensional landscape 
with the relative size and height of a peak in the three- 
dimensional landscape representing the relative significance 
20 of the relationship of a topic attribute and the individual 
documents in the document set, comprising the steps of: 

(a) building an electronic database of a plurality of 
documents; 

(b) creating a plurality of high dimensional vectors, one 
for each of said plurality of documents such that each 
of said high dimensional vectors represents the relative 
relationship of the individual documents to the topic 
attribute; 

(c) arranging said high dimensional vectors into clusters, 
each of said clusters representing a plurality of docu- 
ments grouped by the relative significance of their 
relationship to a topic attribute; 

(d) calculating centroid coordinates as the center of mass 
of each cluster, the centroid coordinates being stored or 
projected in a two-dimensional plane; 

(e) constructing a vector for each document, said vector 
containing the distance from the document to each 
centroid coordinate in high-dimensional space; 

40 (f) creating a plurality of term layers, each of said term 
layers corresponding to a descriptive term applied to 
each cluster, and identifying x,y coordinates for each 
document associated with each term layer; and 
(g) creating a z coordinate associated with each term layer 

45 for each x,y coordinate by applying a smoothing func- 
tion to the x,y coordinates for each document, and 
superimposing upon one another all of said term layers. 

* ♦ * ♦ * 
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ABSTRACT 



A computer information processing system utilizes parallel 
processors for organizing and clustering a large number of 
documents into a large number of clusters for information 
analysis and retrieval. After the documents are translated 
into electronic digital documents, each document is con- 
verted into a vector based on weighted list of the occurence 
of dififerent words and terms thai appear in the document. 
The document vectors are grouped together into cluster 
vectors on different parallel processors according to simi- 
larities. New document vectors are simultaneously com- 
pared with existing cluster vectors in the different parallel 
processors. 

1 Claim, 9 Drawing Sheets 



step 1; Document Vector 1 forms Cluster 1 on Processor 1 
Document Vector 1 




Step 2: Document Vector 2 is compared to Cluster 1 and may form cluster 2 on 
Processor 2 



Document Vector 2 




( Cluster 1 



Step 3: Document Vector 3 is compared to Cluster 1 and Cluster 2 
simultaneously and may form cluster 3 on Processor 3 (Process repeats for all 
documents) 

Document Vector 3 
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FIG. 1o 

PRIOR ART 

Document is "data base management system" 



Form term signatures: data 0000 0010 0000 1000 

base 0100 0010 0000 0000 

management 0000 0100 0001 0000 

system 0000 0000 01 01 0000 



FIG. lb 

PRIOR ART 

Form document signature by Oring each column 0100 0110 0101 0000 
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FIG. 2a 

PRIOR ART 



Keyword-Boolean matrix for keywords APPLE. ORANGE, BANANA. GRAPE 



Terms 


DO 


D1 


D2 
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D4 
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APPLE 
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BANANA 
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GRAPE 
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ORANGE 
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1 



10 0 0 
0 10 0 
10 11 
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FIG. 2b 

PRIOR ART 

Inverted index for same terms: 

Terms Documents 

APPLE DO. D2. D4. D6 

BANANA D3. D5, D7 

GRAPE D2, D6. D8. D9 

ORANGE D1.D2. D3. D4, D5 
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FIG. 3 



step 1 : Document Vector 1 forms Cluster 1 on Processor 1 
Document Vector 1 




Step 2: Document Vector 2 is compared to Cluster 1 and may form cluster 2 on 
Processor 2 



Document Vector 2 





Step 3: Document Vector 3 is.compared to Cluster 1 and Cluster 2 
simultaneously and may form cluster 3 on Processor 3 (Process repeats for all 
documents) 

Document Vector 3 
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FIG. 5 

TIMES TO PERFORM CLUSTERING FOR CZ?,TA,<:,n 
30 I 1 1 — 1 




T1_32j 

T1_64j 

T1_128i 
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Fl G. 7A 

TIMES TO FORM CLUSTERS AND SPEEDUP FOR C^;TA,C'.4] 
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FIG. 7B 
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FIG. 8 A 
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FIG, 9A 
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PARALLEL DOCUMENT CLUSTERING Table 3 is the lime lo make the clusters with 2 processors. 

PROCESS Table 4 is the time to make the clusters with 4 processors. 

FIELD OF INVENTION ^^^^^ ^ ^ ^ '^^^^^^^^ 

sors. 

The present invention relates to a clustering process to Table 6 represents the speedup of 2 processors over 1 

organize a vast amount of text-based documents into related processor. 

groups of generally accepted clusters for subsequent query- j^ble 7 represents the speedup of 4 processors over 1 

driven retrieval and information analysis. processor. 

REFERENCES Table 8 represents the speedup of 16 processors over 1 

processor. 

Publications Table 9 represents the eflBciency of using 2 processors. 

[CHEN92] Chen. H., K. Lynch (1992) "Au.omauc Cbn- ^able 10 represent the efficiency of using 4 processors, 

slruction of Networks of Concepts Characterizing Docu- ,5 Table 11 represents the efficiency of using 16 processors, 

menl Databases," IEEE Transactions on Systems, Man, and BACKGROUND OF INVENTION 
Cybernetics, 22(5), pp 885-902. 

[DAHL92] Dahlhaus. E. (1992) "Fast Parallel Algorithm Information retrieval is the process of retneving docu- 

for the Single Link Heuristics of Hierarchial Qusiering," ^"ich are relevant to a query generated by a user. 

Proceedings of the Fourth IEEE Symposium onParallel and ^ four dommate models for infomiaUon retncval: 

Distributed Processing, pp 184-187 Ml-text scanning, keyword-Boolean operations, signamre 

, V , , files, and vector-space. The models can be charactenzed as 

[JA1N88] Jain, A. K and R. C. Dubes (1988), Algorithms ^.^^^^ formatted or unformatted. An unformatted model 

for Clustering Data, Englewood Cliffs, NJ.: Prentice Hall. ^^j^^^ ^^^^ document exacUy as it appears in the document 

[LI90] Li, X. (1990) "Parallel Algorithms for Hierarchial ^^x. In other wonjs, the document does not undergo any 

Clustering and Cluster Validation," IEEE Transactions on processing into a different form. Formatted models process 

Pattern Analysis and Machine Intelligence, 12(11), pp (jje documents into a separate form or data structure. Query 

1088-1092. processing is done with separate forms or data structures. 

[MURT92] Murlagh, F. (1992), "Comments on ^Parallel One model is full-text scanning. This model takes a 

Algorithms for Hierarchial Clustering and Cluster phrase from the user and searches each document in a 

Validation"', IEEE Transactions on Pattern Analysis and collection for an exact match. This model is unformatted. 

Machine Intelligence, 14(10), pp 1056-1057. hence each document must be completely scanned with each 

[SALT83] Salton, Gerald and Michael J. McGill (1983). query. There are some efficient algorithms for text scanning. 

Introduction to Modern Information Retrieval New York, However this model, in general, is only used in small 

N.Y.: McGraw-Hill. 35 document sets where the user is familiar with the documents, 

[SALT89] Salton, Gerald {\9%9). Automatic Text Process- i-^- personal files. 

ing. Reading, Mass.: Addison -Wesley. Fonmattcd models take a document and convert them to 

[TIBE93] Tiberio, P. and P. Zezula (1993). "Selecting some other form. Usually, this fonn is based on an ability to 

Signature Files for Specific Applications," Information Pro- discriminate among documents. One way to d^cnminate is 

cessing and Management, 29(6), pp 487-498. 40 to identify words within the documents arid bmld the for- 

matted structure from discnminating words. 1 nese words 

DESCRIPTION OF DIAGRAMS are frequently referred to as descriptors. In many cases, 

descriptors are reduced to word stems. In other words, terms 

A complete appreciation of the myention and Us advan- ^.j^^ "engineer", "engineering", or "engineered" would all 

tagcs can be attained by reference to the background leading ^^^^^^ ^^^^ "engineer^'. 

to the invention and the summary and detailed description 01 ^ , . ^, . r . r .i. • . c 

, J J ■ • ♦u A sienature fi e is formed from the signatures ot a 

the invenion when considered in conjunction with the , . -r r • . u* ™- a 

. document set. To form a signature, each term in a document 

accompanying drawings: ^ converted to a standard size term consisting of O^s and Ts. 

HG. 1 shows the formation of a signature file. q^^^ ^^^^ converted, the document signature is 

FIG. 2 shows the formation of a key word- Boolean matrix formed by performing an ORing operation. FIG. 1 details 

and inverted index. the formation of a document signature. To perform a query, 

FIG. 3 provides a general overview of the invention's the query is converted to a query signature. This signature is 

operation compared to each document signature. A match occurs if a 

FIG. 4 provides a generalized result of the invention. "1" in the query signature can be aligned with a "1" in a 

BG. 5 shows time to form clusters on 1 processor. 55 document query. Unfortunately, if there is a match it is 

. . . . f _ c.«*...ri,.« f«r o unknown if the match occurs because the document matches 

FIG. 6 shows time to form clusters and speedup tor 2 , . , . r 

^ the query or simply because the process of conversions 

processors. ^ , ^ ^ c a attributed to the alignment. A document which is retrieved 

RG. 7 shows time to form clusters and speedup for 4 alignment based on the conversion process yet does 

processors, match the actual query is called a "false hit". Signature 

FIG. 8 shows time to form clusters and speedup for 16 ^^^^ ^^^^ ^s a filtering process which retrieves 

processors. potential matches with some other process use to screen 

FIG. 9 shows times to form clusters and speedup across false hits from the retrieved set. 

processors. Keyword-Boolean models typically are represented by a 

T^ble 1 is the number of clusters for each permutation of 55 matrix where the columns represent documents and the rows 

the Wall Street Journal Set. represent keywords. If a keyword appears in a document 

Table 2 is the time to make the clusters with 1 processor. then the row corresponding to the term and the column 
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corresponding to the document has a "1" placed in il. FIG. documents are added to a collection they are not immedi- 

2a shows a typical Boolean model matrix. A query is ately available to the user. Instead they are held until such 

formulated xising the Boolean operators AND, OR, NOT. time as a sufficient number of new documents have been 

Each term in the query is compared to the matrix. If a "1" acquired. Then, in an off line process, the new documents 
appears in the [row, cohimn] corresponding to the [term, $ and old documents go through the clustering process. This 

documeni] that document is placed into a set of potential results in a new set of cluster centroids. Hic new set of 

documents. Once documents are placed in the set the Bool- cluster centroids is eventually provided to the user as a new 

ean operations are performed. In effect the set of potential ^j^^ vereion. As document sets increase, the amount of 

documents is pared down to the final set presented to the ^-^^ between versions increases as well. However, if a 

user. However, for each query, each ternj must be tested for document set doubles, it takes four limes as long to form the 

in each documeni. Searching can be tnade more efficient by ^^^^^^^^ ^^^^^^ unacceptable time delay 

the use of an mverled index. An inverted . index or m for- ^ ^^^^^^^^ ^ .^^ ^ 

mation retrieval is much the same as the mdex of a textbook. ^ ^^^^ ^^^^ document. Th, Parallel Document 

It identifies keywords and indicates which documents con- Clustering Process can greaUy reduce this time, 

tain those keywords. FIG. 2b shows a scheme for an inverted ^ ^ j 

index. While improving searching time, the indices them- SUMMARY OF INVENTION 

selves can be very large. Keyword-Boolean systems are r • i. i.- r ' . « • . 

dependent on the indexing scheme. If a temt is not a iMs the object of this invenUon lo provide an efficient 

keyword, there is no way to query for that term within ^^^^^ f orgamzmg a laiBe body of documents mto 

documents. If a new word is added to the index, each clusters for subsequent re tneval. 

document may need to be re-examined to identify if the It is a further object of the invenUon to provide a fast 

documeni contains the new term. Assigning the keywords as method of clustenng to reduce the amount of lime a docu- 

descriptors of documents is a complex problem [TIBE93]. It ^ent waits from its receipt into the document set until it is 

has been shown that the probability of two people using the available for retrieval by users. 

same descriptors to categorize a document is low (10-20%) A more specific object of the invention is to perform the 

[CHEN92]. process on any available multiple processor (parallel) 

The last model to be discussed is the vector-space model. machine. 

In this model, each document is considered to be a collection The invention is a process for forming clusters of large 

of terms. The number of times each word occurs is consid- document sets using multiple processor machines. The pro- 

ered that term's weight. The terms and weights for each cess is useful for identifying similarity amongst documents 

document can be organized into a document vector consist- and for subsequent use in document retrieval systems. For 

ing of n-dimcnsions, where n is the number of terms. A demonstration purposes, the Single-Pass Heuristic is used as 

query is also a collection of terms. Aqtiery can be converted the clustering method. This method is generally accepted 

into a query-vector and the query vector is compared to the within the information retrieval community, 
documeni vector. Document and query vectors are compared ^5 Text documents arc converted to document vectors. Qus- 

using vector operations and if some threshold of similarity ters of documents are represented by cluster vectors which 

is reached, the document is considered for retrieval. Typical have the same format as document vectors. A cluster vector 

measures of similarity are the Cosine, Dice, or Jacquard is made by taking the numerical average of all the docu- 

similarity measures. ments which comprise the cluster. A document vector is 

The formatted models (signature files, keyword-Boolean compared to each cluster vector. As soon as a determination 

Operations, Vector-Space) process the documents to form is made that the document is similar to a cluster, the 

different data structures. These data structures ignore the comparisons stop and the document is added to that cluster, 

relationships between the words in a document. If word In the event a document is not similar to any cluster, a new 

concordance is a concern, some additional data structure cluster is formed. 

must be maintained. Word count, or the number of times a 45 The invention follows the same tenets of forming a 

given word occurs in a document, is lost in signature files, document vector which is compared to cluster vectors, 

and in Boolean models. Word count can be considered as However, the key innovation is that instead of comparing a 

inherent in the value of attributes in the vector-space model. documeni vector to a single cluster, it compares the docu- 

All the models have been described in terms of how ment vector simultaneously with P clusters, where P is the 
documents compare to a query. However, none of the 50 number of processors within the system. Each processor is 

models describe how documents relate to each other. The responsible for creating and maintaining certain clusters, 

signature file model, the text scanning model, and the Each processor uses the same document vector. The proces- 

Boolean models simply do not have capabilities to relate sor compares this document vector to each of its clusters, 

documents to each other. The Vector-space model, with its When a sufficient threshold of similarity is exceeded, the 
ability to measure similarity is readily capable of determin- 55 processor records which cluster vector is similar to the 

ing how documents compare to each other. This ability document vector. When all processors have compared the 

forms the underlying premise of the idea of clustering. document vector, a decision is reached to determine to which 

The basic idea of clustering is that items which are similar cluster the document vector is added to. If a determination 

can be grouped logether. Each group, or cluster, has a is made that the documeni was not similar to any existing 
centroid vector, which is the average of the document 60 cluster, a processor is designated to create a new cluster, 

vectors making up the cluster. Query vectors are compared By way of example, the process begins by converting the 

to the cluster centroids, if a match is determined, the entire first document vector into a cluster vector. This vector is 

cluster is retrieved. Clusters can reduce the search require- stored in processor 1. The next document is compared to this 

ment as each centroid contains information about numerous vector. Assuming the threshold is not exceeded, the process 
documents. 65 forms cluster 2 on processor 2. The third document is now 

Qustering has been used in retrieval systems. However, it compared simultaneously with cluster 1 and 2. If the docu- 

has always been done in a sequential fashion. As new ment cannot be added to either cluster, then cluster 3 is 
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formed on processor 3. The next document is now compared 
simultaneously with cluster 1, cluster 2 and cluster 3. This 
process continues until all documents in the set are placed 
into appropriate clusters. FIG. 3 shows this process. While 
there is no theoretic limit to the number of processors, a 
claim of the invention is that it performs on realistic, existing 
machines. Therefore, each processor nnay be responsible for 
more than one cluster. FIG. 4 shows the overall cflfcct of the 
process on a machine with 4 processors. 

DETAILED PROCESS 
Each text document is scanned to create a document 
vector, Slop words arc removed and the number of times 
each word appears in a document is counted. Each unique 
word in a document is given an administrative number based 
on a master dictionary. The document vector is basically a 
set of paired numbers. The first pair indicates the total 
number of words in the document and the second number of 
the pair represents the number of unique words within the 
document. Each subsequent pair is made of up the number 
of appearances of a word and the administrative number of 
that word. Formation of the document vector is done off line 
and is not part of the invention. Each cluster has a uniquely 
designated global number. The cluster ccntroid vector has 
the same form as a document vector. That is a series of 
paired numbers. The first pair represents the total number of 
words in the cluster and the total number of unique words in 
the cluster. Each subsequent pair is made of up the number 
of appearances of a word and the administrative number of 
that word. The cluster vector is formed as the mathematical 
average of the document vectors within the cluster. The 
following formula shows how each element of a document 
vector is added to its corresponding element in a cluster 
vector: 



10 



15 



20 



25 



only documents which have exactly the same words and 
exactly the same number of appearances of those words will 
be considered similar. 

Each processor accesses the same document vector at the 
same time. Using equation (2), a processor detemiincs if a 
document is similar to any of its clusters. It is possible for 
a document vector to be similar to more than one cluster. 
However, the tenets of the Single-Pass method are such that 
a document is only placed into the first cluster which 
exceeds the threshold. For the invention to produce the same 
results as the Single- Pass, there must be communications 
among the processors to determine which cluster was the 
first to have the threshold exceeded. It is important to note 
that the use of the teran "first cluster" is based on the order 
the clusters were formed. Quster 1 is formed before cluster 
2, which is formed before cluster 3, etc. Since the clusters 
are numbered globally, the cluster with the lowest global 
number is determined to be the cluster which incorporates 
the document. So if processor 2 reports the threshold was 
exceeded with cluster 7 and processor 5 reports that the 
threshold was exceeded on cluster 5 then the document is 
added to cluster 5. This assignment occurs even if the actual 
computations by processor 2 where finished ahead of pro- 
cessor 5. If no cluster is found, the system generates a new 
cluster and one of the processors takes responsibility for the 
cluster. Clusters are apportioned to processors in a modulus 
fashion, that is: 



0) 



s Weight of appearances of word / within cluster Cj 
B Weight of appearances word i within document Y 
N - Number of documents within cluster Cj 
Weight based on appearances of term il 
total words of cluster or document 

Ousters are formed based on a similarity measure 
between the cluster centroid and a document vector One 
popular similarity measure is the COSINE measure based on 
the formula: 



45 



50 



COSINE(C/, iO = - 



(2) 



c-j m weight of ith term of cluster C; 
Yt - weight of ith term of document Y 

If COSlNE(Cy,Y) exceeds a threshold (ranging from 0 to 1), 
the document and cluster are considered to be similar. The 
document is added to the cluster using equation 1. Every 
cluster C is compared to document Y until either the 
threshold is exceeded or all clusters have been checked. In 
the latter case, a new cluster is formed. Forming a new 
cluster is very simple. The document vector is simply 
redesignated as a cluster and given a number. To put the 
threshold into perspective, a threshold of 0 implies that any 
document will match a cluster. A threshold of 1 implies that 



55 



60 



65 



' imodf. 



(3) 



35 



Where: 

P^tuMur i ^ Processor designated for cluster / 
/ is the cluster being formed 
Pt is the total number of processois within the system 

The invention was used with a set of articles from the Wall 
Street Journal. The articles were written at different times of 
the year, by different reporters, and cover an assortment of 
topics. They are also of varying lengths. In other words, the 
document set is not categorized into any preestablished 
category, such as, computer oriented, financially oriented, 
legally oriented, etc. The documents were formed into 
individual document vectors and each is stored as a separate 
file with a unique file name. There is a master file which 
contains the file names of the document vectors. When the 
vectors were formed stop words were removed but there was 
no stemming. Stemming can be easily incorporated into the 
scheme if desired as all conversion from text format to 
vector format is outside the scope of the invention. 

There are several parameters which must be established 
prior to executing the process. These are: 

a) D, the number of documents in the document set. 

b) O, the order in which document vectors are accessed 
for comparison. 

c) C, the threshold value of the COSINE coefiScient. 

d) P, the number of processors in the actual system. 
The Single-Pass method places a document into the first 

cluster which exceeds the threshold. This has two very 
characteristic results. The first characteristic is that the 
clusters which are formed first tend to be larger (contain 
more documents) than clusters which are formed later in the 
process. The second characteristic is that the process is 
order-dependent. That means if the order in which the 
document vectors are compared changes, the clusters may 
change as well. These two characteristics are known and 
generally accepted as a by-product of the Single-Pass 
method. To show the generality of the invention, results are 



03/16/2004, EAST version: 1.4.1 



5,864,855 



8 



10 



based on five differeni orderings of the documenl set. The 
ordering of the documenl vectors is done through manipu- 
lation of the master file of document vectors. 

The first runs were done with a random ordering (R) of the 
documents. Basically, this can be considered as taldng the $ 
documents in the order in which they arrived to the system. 
While the document vectors are formed, little is known 
about their content. However, once a document vector is 
formed, it is very easy to identify two characteristics. The 
first characteristic is the total number of words in the vector 
and the second characteristic is the number of unique words 
in the document vector. These values are provided by the 
first numbered pair of the document vector. With this 
information, it is possible to arrange the document vectors 
based on total woixls or number of unique words. This being 
the case, the document vectors were ordered by total words 
in ascending order (TA) and descending order (TD). Order- 
ing was also done based on number of unique words in 
ascending order and descending order, UA, and UD, respec- 
tively. 

The threshold coeflBcient is the basis of comparison 20 
between a document and a cluster. If the cosine exceeds this 
threshold, the document and cluster are deemed to be 
similar. The coefiScient can range from 0.0 to 1 .0. The higher 
the value, the more similar a document must be in order to 
be added lo a cluster. While the range can be 0.0 to 1.0, 25 
values at the extremes have little meaning. A value of 0.0 
implies that there will be a single cluster incorporating aQ 
the documents. A value of 1.0 would basically filter all the 
documents and only cluster those which are duplicates. The 
invention was run on coefiBcients of 0.2, 0.4, 0.6, and 0.8. 30 

The intent of the invention is to be used with large 
document sets. To show the effect of the invention as 
document set size increases, document sets of 32, 64 and 128 
were used. These sets were run at each of the four coeffi- 
cients and in each of the five different orders. 35 

The last parameter to be established is the number of 
processors to use. While there is no theoretic limit to the 
number of processors, the intent of the invention is to show 
the effectiveness on a real, practical, and commercially 
available machine. The process was run initially on 1 40 
processor. This indicates the time the process would take on 
a regular, sequential machine. The process was run on 
machines consisting of 1 , 2, 4, and 16 processors. A run then 
consists of a combination of the four parameters [D, O, C, 
P ]. In total, there were 240 permutations run, each pcrmu- 45 
tation was run 5 times. Results are based on the average of 
these runs. 

The initial runs were used to establish the number of 
clusters. In general, the fewer the clusters, the broader the 
scope of those clusters. The more clusters there are, the more 50 
specific the scope of the clusters. Correspondingly, there 
tend to be more clusters at higher values of the coefiScient. 
There is no direct link between selecting a coefiBcient and the 
desire to form a specific number of clusters. Table 1 provides 
the results of the clustering process itself. It is evident that 55 
the value of the coefficient is the major determining factor in 
the number of clusters formed. In the Single-Pass method, 
the number of processors used has no effect on the number 
or composition of the clusters formed. In other words, the 6 
clusters formed with the 32 document set, in random order 60 
with a coefficient of 0.2 are the same whether the process 
was run on 1 processor or 16 processors. It is interesting to 
note that the ordering scheme TAand UAtend to provide the 
most clusters while the schemes TD and UD provide the 
least. It should be made clear that it has yet to be determined 65 
if there is a combination of [D, 0, C] which results in an 
optimal clustering of any document set. 



Tables 2, 3, 4, and 5, represent the time it takes to do the 
process on 1, 2, 4, and 16 processors. Table 2 is the process 
time on one processor. As such, it reflects the process time 
on the typical sequential machine. This is the basis for 
comparison as more processors are added to a system. The 
time it takes to perform the process on multiple processors 
has two main components. The first component is the time 
it takes to do actual processing, e.g. mathematical opera- 
tions. The second component is the time it takes to do 
internal message passing. The times depicted in tables 3, 4, 
and 5 are the combined total of the two components for each 
[D, O, C, P ] combination. The effects of the components 
will be explained in accompanying narrative. There are three 
factors which are used in evaluating the process. The first is 
simply the time to perform the algorithm. The second factor 
is speedup which indicates how much faster an algorithm 
runs as more processors are added. It is a value ranging from 
1 to P, the number of processors. The third factor is 
efficiency, or how much effort is each processor contributing 
to the accomplishment of a task. Efficiency is a value of 0 to 
1 . In general, the higher the speedup and efficiency the better 
an algorithm performs. It is also generally accepted that the 
best speedup which can be attained is when speedup is the 
same as the number of processors and the best that efficiency 
can be is 1. Speedup and efficiency can be mathematically 
defined by: 



(a)Sp — (b)TV. 

Tp 



Sp 
P 



(4) 



Sp - Speedup with P processors 
Tj - Time with 1 processor 
Tp " Time with P processors 
Tip « EfiEcicncy with P processors 

Table 2 indicates the time, in seconds, it takes to form the 
clusters for each configuration using 1 processor. This 
becomes the basis for comparison of the process as more 
processors are added. Each time represents a combination of 
the four parameters [D, O, C, P ] so conclusions are 
described in terms of isolating parameters. As D, O, and P 
are held constant, it is clear that increasing C causes an 
increase in execution time. For example [32,R,0.2,1] takes 
0.937 seconds whereas [32,R,0.4,1] takes 1.354 seconds, 
[32,R,0.6,1] takes Z361 seconds and [32,R,0.8,1] takes 
2.495 seconds. In general, it can be seen that the time 
increases as C increases. In other words, it takes longer to 
make more clusters which is intuitive. It is also intuitive that 
it takes one processor longer to perform the algorithm as the 
number of documents increase. This is evident in holding 0, 
C, P constant and allowing D to change from 32 to 64 and 
then to 128. 

Table 3 is the time to perform the algorithm for each 
permutation with two processors or [D, O, C, 2]. It can be 
seen that for permutations where C-0.2, the algorithm does 
worse than with 1 processor This is to be expected given the 
nature of the Sin^e-Pass method. As previously stated, the 
Single-Pass tends to put documents into the first clusters. So 
the first processor is initially doing work while the second 
processor is idle. This accounts for poor speedup and 
efficiency as well. In effect, only one of the processors is 
working. Note, however, that as the value of C increases, so 
does the number of clusters. As these clusters are 
apportioned, there is more work to do and each processor 
takes on a more even distribution of the load. This is 
especially seen as both D and C increase. 

Table 4 shows the times to perform clustering with four 
processors, and Table 5 represents the results of using 16 
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processors. Tables 6 through 8 reflect the speedup for Po2, 
4, and 16, respectively. By definition, there is no speedup 
when only one processor is used. The values in these tables 
were derived from equation (4)(fl). Equation (4)(6) was used 
to determine the efficiency of the invention as the number of s 
processors increases. TTie results are available for 2, 4, and 
16 processors in Tables 9 through 11. A graphic represen- 
tation of the results is presented in FIGS. 5 through 9. For 
clarity, the results of 0«TA has been isolated. 

To show the effects of the invention over a range of lO 
document set sizes, various graphs are superimposed over 
the same set of axis. To allow for superimposing the graphs, 
a special notation has been adopted. "P' represents time, the 
digit represents the number of processors and the ntmiber 
following the underscore represents the size of the document ^5 
set. So TI_32 represents the time needed for 1 processor 
operating on 32 documents. In a similar fashion, T16__128 
represents the lime for 16 processors operating on 128 
documents. The use of "C along the bottom axis represents 
the value of the coefficient (0.2 through 0.8). The sub- 
scripted letter, j, serves as a counter to ensure the alignment 
of results of the documents sets with the appropriate value 
of the coefficient. ''S" is used to represent speedup. S32 
represents speedup for 32 documents, S64 is speedup for 64 
documents etc. Again, the subscripted letter, j, ensures 
proper alignment of speedup based on the document sets and 
the coefiGcicnl. In the final figure, the "P" represents the 
number of processors used with "j" aligning results for each 
value of "P". 

FIG. 5 shows the time to cluster the document sets for 
each value of C and D. As intuition would indicate, it longer 
to cluster a larger number of documents into a larger number 
of clusters. The significance of FIG. 5 is in how it compares 
to the performance as the number of processors increases. 
The subsequent figures represent lime with various values of 
P. The speedup chart which accompanies each time chart 35 
reflects the performance of a given value of P compared P=l . 

FIG. 6 displays the results of using 2 processors. The chart 
shows a decrease in time using the 2 processors versus using 
1 processor. The speedup chart reflects how much faster the 
invention is performing. To determine the speedup select the 
coefficient of interest, for example 0.6. Draw a line straight 
up until that line intersects with the lines representing the 
document set of interest, for example, SI 28. From the 
intersection point draw a line straight lo the left. The point 
where this line intersects the vertical axis is the speedup. 
With the values of C«0.6, and D=128 (S128) the speedup is 45 
1 .79. In other words, [128, TA,0.6,2] is 1.79 times faster than 
[128,TA,0.6,1]. 

The next figures represent results for P=4 and P=16. In all 
cases it can be seen that the speedup is better as D and C 
increase. It is clear from FIG. 8 that there is a marked 50 
increase in the performance of the invention as D, C, and P 
have increased. 

FIG. 9 isolates 0=TA and C«0.8 to better show time to 
perform the clustering and the invention's speedup across an 
increase in P. FIG. 9 shows several things. First it is clear 55 
that speedup will decrease if D is constant and P is allowed 
to increase. Clearly, the invention takes less time as the 
number of processors increase. However, note that the curve 
for D=128 continues to decrease. It is important to note that 
speedup is increasing as both D and P increase. For example, 
the D=»32 has a break point at P=4. This break point indicates 
it takes longer to perform the clustering as the number of 
processors increases. This increase is caused because the 
document set is not large enough to keep each processor 
productively employed. Also, as the number of processors 
increases, so to does the amount of time spent passing 65 
messages. The breakpoint is an indicator of when the 
messaging overhead takes longer than the processing time to 



perform the clustering. The breakpoint is not yei reached for 
D«64 but is flattening out. The figure clearly shows that 
D-128 is still rising at P-16 which indicates P can increase 
and still provide reasonable speedup performance. More 
importantly, the figure shows two significant trends. The first 
is that performance is better if D increases while P is held 
constant. The other trend is the increased performance of the 
algorithm for increases in D and P. These results clearly 
show the invention does as staled: that is efficient clustering 
of large document sets in a practical paraUel environment. 
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T>iblc 1 represents number of clusters based on number of documents, order 
of documents and coefiBcient of similarity. 
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TKble 2 represents the time required to form clusters using D - 1. Time 
includes the processing time and the messaging time. 
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Tabic 3 represents ihe lime required to form clusters using D = 2, Time 
includes the processing time and messaging time. 
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Thblc 4 represents the time required to form clusters using D » 4. Time 
includes the processing time and the messaging time. 
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T^ble 5 represents the time required to form clusters using D < 
includes the processing time and the messaging time. 
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Table 6 represents the speedup attained using D 
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T^ble 7 represents the speedup attained using D 
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T^ble 8 represents the speedup attained using D «* 16. 
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32 


R 


0.420 


0.532 


0.766 


0.818 




TA 


0.343 


0.704 


0.816 


0.822 




TO 


0.403 


0.672 


0.813 


0.826 




UA 


0.359 


0.721 


0.779 


0.825 




UD 


0.433 


0.642 


0.803 


0.825 


64 


R 


0.438 


0.664 


0.900 


0.843 




TA 


0.414 


0.770 


0.871 


0.908 




TO 


0.444 


0.700 


0.883 


0.898 




UA 


0.463 


0.758 


0.857 


0.887 




UD 


0.450 


0.7S3 


0.883 


0.900 
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5,864,855 



13 

-coalinued 



Documents 


Order 


Cocf - 0.2 


Coef - 0.4 


Cocf -0.6 


C&cf-CS 


12S 


R 


0.611 


0.729 


0.934 


0.906 




TA 


0.449 


0.641 


0.895 


0.959 




TD 


0.465 


0.694 


0.873 


0.923 




UA 


0.449 


0.605 


0.872 


0.937 




UD 


0.465 


0.735 


0.887 


0.948 



T^ble 9 represents the eflSdency attained using D - 2, 



14 

-continued 



10 



D 


0 






C 




Documents 


Order 


Cbcf - 0.2 


Cocf - 0.4 


Coef - 0.6 


Cocf - as 


128 


R 


0.QS6 


0.249 


0.386 


0.404 




TA 


0.049 


0.167 


0.336 


0.390 




TD 


0.058 


0.218 


0.388 


0.428 




UA 


0.049 


0.152 


0.319 


0.395 




UD 


0.058 


0.205 


0.391 


0.447 



D 


0 






C 




Documents 


Order 


Coef = 0.2 


Coef o 0.4 


Coef B 0.6 


Coef " 0.8 


32 


R 


0.194 


0.334 


0.543 


0.573 




TA 


0.174 


0.461 


0.567 


0.594 




TD 


0.213 


0.386 


0.566 


0.596 




UA 


0.154 


0.451 


0.555 


0.538 




UD 


0.212 


0.361 


0.586 


0.590 


64 


R 


0.222 


0.482 


0.730 


0.691 




TA 


0.205 


0.582 


0.707 


0.738 




TD 


0.223 


0.443 


0.753 


0.752 




UA 


0.215 


0.621 


0.698 


0.729 




UD 


0.223 


0.554 


0.747 


0.753 


128 


R 


0.229 


0.525 


0.806 


0.777 




TA 


0.223 


0.488 


0.781 


0.862 




TD 


0.229 


0.489 


0.799 


0.850 




UA 


0.220 


0.476 


0.758 


0.854 




UD 


0.231 


0.538 


0.809 


0.868 


Table 10 represenU the efficiency attained using 


D - 4. 




D 


0 






C 




Documents 


Order 


Coef = 0.2 


Coef = 0.4 


Coef =0.6 


Coef = 0.8 


32 


R 


0.040 


0.068 


0.121 


0.126 




TA 


0.027 


0,089 


0.114 


0.117 




TD 


0.046 


0.081 


0.130 


0.139 




UA 


0.026 


0.077 


0.115 


0.114 




UD 


0.045 


0.079 


0.136 


0.143 


64 


R 


O.OSl 


0.166 


0.244 


0.254 




TA 


0.041 


0.164 


0.210 


0.232 




TD 


0.55 


0.154 


0.240 


0.276 




UA 


0.044 


0.157 


0.204 


0.229 




UD 


0.054 


0.162 


0.262 


0.286 



25 



30 



35 



45 



l^ble 11 represents the efficiency attained using D = 16. 

We claim: 

1. In an arrangement of parallel processors in a computer 
information processing system, a parallel clustering method 
for examining preselected documents and grouping similar 
documents in the parallel processors for subsequent retrieval 
in an electronic digital format from the computer informa- 
tion processing system, the steps comprising: 
converting each preselected document into an electronic 

document in digital format; 
converting each electronic document into a vector, 
whereby a vector is a weighted list of the occurence of 
different words and terms that appear in the document; 
selecting a first electronic document and designating the 
vector of the first electronic document as a first cluster 
vector whereby a cluster vector is the mathematical 
average of all of the document vectors having similar 
characteristics, and assigning the first cluster vector to 
a first processor of the parallel processors; 
selecting a second electronic document and comparing the 
vector of the second electronic document with the first 
cluster vector to determine if the second document 
vector has similar characteristics, and assigning the 
second document vector to the first cluster vector if 
they have similar characteristics or designating the 
second document vector as a second cluster vector and 
assigning the second cltister vector to a second proces- 
sor of the parallel processors if there arc different 
characteristics; and 
selecting each subsequent electronic document and com- 
paring the vector of each subsequent electronic docu- 
ment with all existing cluster vectors simultaneously on 
each processor having a cluster vector, and assigning 
each subsequent document vector to a parallel proces- 
sor having the most similar characteristics or designat- 
ing the subsequent document vector as a subsequent 
cluster vector and assigning the subsequent cluster 
vector to a processor of the parallel processors if there 
are different characteristics. 
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